Approximate Sparse Filtrations

In this module, we will explore an approximation algorithm which is meant to reduce the run time of the persistence algorithm [1]. For more information on this algorithm, please view the following video:

https://www.youtube.com/watch?v=3WxuSwQhAgU

[1] Cavanna, Nicholas J., Mahmoodreza Jahanseir, and Donald R. Sheehy. "A geometric perspective on sparse filtrations." Proceedings of the Canadian Conference on Computational Geometry (CCCG 2015).

First, let's import the necessary libraries


In [ ]:
from ripser import ripser
from persim import plot_diagrams, wasserstein, wasserstein_matching
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances
from scipy import sparse
import time

Now, we define a "greedy permutation," or a function which performs furthest points sampling, a key step in the algorithm used to determine "insertion radii" $\lambda_i$ for each point (this is similar to a cover tree). For an animation of how this works, please visit:

https://gist.github.com/ctralie/128cc07da67f1d2e10ea470ee2d23fe8


In [ ]:
def getGreedyPerm(D):
    """
    A Naive O(N^2) algorithm to do furthest points sampling
    
    Parameters
    ----------
    D : ndarray (N, N) 
        An NxN distance matrix for points

    Return
    ------
    lamdas: list
        Insertion radii of all points
    """
    
    N = D.shape[0]
    #By default, takes the first point in the permutation to be the
    #first point in the point cloud, but could be random
    perm = np.zeros(N, dtype=np.int64)
    lambdas = np.zeros(N)
    ds = D[0, :]
    for i in range(1, N):
        idx = np.argmax(ds)
        perm[i] = idx
        lambdas[i] = ds[idx]
        ds = np.minimum(ds, D[idx, :])
    return lambdas[perm]

Now, we define a function which, given a distance matrix representing a point cloud, a set of insertion radii, and an approximation factor $\epsilon$, returns a sparse distance matrix with re-weighted edges, whose persistence diagrams are each guaranteed to be a $(1+\epsilon)$ multiplicative approximation of the true persistence diagrams (see [1])


In [ ]:
def getApproxSparseDM(lambdas, eps, D):
    """
    Purpose: To return the sparse edge list with the warped distances, sorted by weight
    
    Parameters
    ----------
    lambdas: list
        insertion radii for points
    eps: float
        epsilon approximation constant
    D: ndarray
        NxN distance matrix, okay to modify because last time it's used
    
    Return
    ------
    DSparse: scipy.sparse
        A sparse NxN matrix with the reweighted edges
    """
    N = D.shape[0]
    E0 = (1+eps)/eps
    E1 = (1+eps)**2/eps
    
    #Create initial sparse list candidates (Lemma 6)
    nBounds = ((eps**2+3*eps+2)/eps)*lambdas #Search neighborhoods    
    D[D > nBounds[:, None]] = np.inf #Set all distances outside of search neighborhood to infinity
    [I, J] = np.meshgrid(np.arange(N), np.arange(N))
    idx = I < J
    I = I[(D < np.inf)*(idx == 1)]
    J = J[(D < np.inf)*(idx == 1)]
    D = D[(D < np.inf)*(idx == 1)]
    
    #Prune sparse list and update warped edge lengths (Algorithm 3 pg. 14)
    minlam = np.minimum(lambdas[I], lambdas[J])
    maxlam = np.maximum(lambdas[I], lambdas[J])
    #Rule out edges between vertices whose balls stop growing before they touch
    #or where one of them would have been deleted.  M stores which of these
    #happens first
    M = np.minimum((E0 + E1)*minlam, E0*(minlam + maxlam))
    #M = E0*(minlam+maxlam)
    
    t = np.arange(len(I))
    t = t[D <= M]
    (I, J, D) = (I[t], J[t], D[t])
    minlam = minlam[t]
    maxlam = maxlam[t]
    
    #Now figure out the metric of the edges that are actually added
    t = np.ones(len(I))
    t[D <= 2*minlam*E0] = 0 #If cones haven't turned into cylinders, metric is unchanged
    #Otherwise, if they meet before the M condition above, the metric is warped
    D[t == 1] = 2.0*(D[t == 1] - minlam[t == 1]*E0) #Multiply by 2 convention
    return sparse.coo_matrix((D, (I, J)), shape=(N, N)).tocsr()

Now let's set up a point cloud we can test this on, which has enough points for ripser to start slowing down a bit. We'll perform the full rips filtration on this point cloud as a ground truth, and we will time it


In [ ]:
NPoints = 2000
t = np.linspace(0, 2*np.pi, NPoints+1)[0:NPoints]
X = np.zeros((NPoints, 2))
X[:, 0] = np.cos(t)
X[:, 1] = np.sin(2*t)
X += 0.1*np.random.randn(NPoints, 2)
plt.scatter(X[:, 0], X[:, 1])


tic = time.time()
resultfull = ripser(X)
toc = time.time()
timefull = toc-tic
print("Elapsed Time: %.3g seconds, %i Edges added"%(timefull, resultfull['num_edges']))

Now let's run an approximate version and plot H1 for both next to each other

Questions

  • What happens to the approximation when you change $\epsilon$? What happens to the runtime?

In [ ]:
eps = 0.05


#Cmpute the sparse filtration
tic = time.time()
#First compute all pairwise distances and do furthest point sampling
D = pairwise_distances(X, metric='euclidean')
lambdas = getGreedyPerm(D)
#Now compute the sparse distance matrix
DSparse = getApproxSparseDM(lambdas, eps, D)
#Finally, compute the filtration
resultsparse = ripser(DSparse, distance_matrix=True)
toc = time.time()
timesparse = toc-tic
percent_added = 100.0*float(resultsparse['num_edges'])/resultfull['num_edges']
print("Elapsed Time: %.3g seconds, %i Edges added"%(timesparse, resultsparse['num_edges']))


#Plot the sparse distance matrix and edges that were added
plt.figure(figsize=(10, 5))
plt.subplot(121)
D = pairwise_distances(X, metric='euclidean')
plt.imshow(D)
plt.title("Original Distance Matrix: %i Edges"%resultfull['num_edges'])
plt.subplot(122)
DSparse = DSparse.toarray()
DSparse = DSparse + DSparse.T
plt.imshow(DSparse)
plt.title("Sparse Distance Matrix: %i Edges"%resultsparse['num_edges'])



#And plot the persistence diagrams on top of each other
plt.figure(figsize=(10, 5))
plt.subplot(121)
plot_diagrams(resultfull['dgms'], show=False)
plt.title("Full Filtration: Elapsed Time %g Seconds"%timefull)
plt.subplot(122)
plt.title("Sparse Filtration (%.3g%% Added)\nElapsed Time %g Seconds"%(percent_added, timesparse))
plot_diagrams(resultsparse['dgms'], show=False)

We can also compute the Wasserstein distance between the two diagrams to see how close they are. Try this with different approximations above


In [ ]:
#Show the Wasserstein distance for H1
I1 = resultfull['dgms'][1]
I2 = resultsparse['dgms'][1]
matchdist, (matchidx, D) = wasserstein(I1, I2, matching=True)
plt.figure(figsize=(5, 5))
#Plot wasserstein matching
wasserstein_matching(I1, I2, matchidx, labels = ['exact', '$\epsilon = %.3g$'%eps])
plt.title("Wasserstein Dist: %.3g"%matchdist)

In [ ]: